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Abstract 


1. Introduction 


Wfe propose an effective structured learning based ap¬ 
proach to the problem of person re-identification which out¬ 
performs the current state-of-the-art on most benchmark 
data sets evaluated. Our framework is built on the ba¬ 
sis of multiple low-level hand-crafted and high-level vi¬ 
sual features. We then formulate two optimization algo¬ 
rithms, which directly optimize evaluation measures com¬ 
monly used in person re-identification, also known as the 
Cumulative Matching Characteristic (CMC) curve. Our 
new approach is practical to many real-world surveillance 
applications as the re-identification performance can be 
concentrated in the range of most practical importance. 
The combination of these factors leads to a person re¬ 
identification system which outperforms most existing al¬ 
gorithms. More importantly, we advance state-of-the-art 
results on person re-identification by improving the rank- 
1 recognition rates from 40% to 50% on the iLIDS bench¬ 
mark, 16% to 18% on the PRID2011 benchmark, 43% to 
46% on the VIPeR benchmark, 34% to 53% on the CUHKOl 
benchmark and 21% to 62% on the CUHK03 benchmark. 
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The task of person re-identihcation (re-id) is to match 
pedestrian images observed from multiple cameras. It has 
recently gained popularity in research community due to its 
several important applications in video surveillance. An au¬ 
tomated re-id system could save a lot of human labour in 
exhaustively searching for a person of interest from a large 
amount of video sequences. 

Despite several years of research in the computer vision 
community, person re-id is still a very challenging task and 
remains unsolved due to (a) large variation in visual appear¬ 
ance (person’s appearance often undergoes large variations 
across different camera views); (b) signihcant changes in 
human poses at the time the image was captured; (c) large 
amount of illumination changes and (d) background clutter 
and occlusions. Moreover the problem becomes increas¬ 
ingly difficult when persons share similar appearance, e.g., 
people wearing similar clothing style with similar color. 

To address these challenges, existing research on this 
topic has concentrated on the development of sophisticated 
and robust features to describe the visual appearance un¬ 
der signihcant changes. However the system that relies 
heavily on one specihc type of visual cues, e.g., color, tex¬ 
ture or shape, would not be practical and powerful enough 
to discriminate individuals with similar visual appearance. 
Existing studies have tried to address the above problem 
by seeking a combination of robust and distinctive feature 
representation of person’s appearance, ranging from color 
histogram spatial co-occurrence representation p4| , 
EBP color SIET ||4g, etc. 

One simple approach to exploit multiple visual features 
is to build an ensemble of distance functions, in which each 
distance function is learned using a single feature and the 
hnal distance is calculated from a weighted sum of these 
distance functions ||6 38 401. However existing works on 
person re-id often pre-dehne these weights, which need to 
be re-estimated beforehand for different data sets. Since 
different re-id benchmark data sets can have very different 
characteristics, i.e., variation in view angle, lighting and oc¬ 
clusion, combining multiple distance functions using pre¬ 
determined weights is undesirable as highly discriminative 
features in one environment might become irrelevant in an- 
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other environment. 

In this paper, we introduce effective approaches to learn 
weights of these distance functions. The hrst approach opti¬ 
mizes the relative distance using the triplet information and 
the second approach maximizes the average rank-A: recog¬ 
nition rate, in which k is chosen to be small, e.g., k < 10. 
Setting the value of k to be small is crucial for many real- 
world applications since most surveillance operators typi¬ 
cally inspect only the hrst ten or twenty items retrieved. 

The main contributions of this paper are twofold; 1) We 
propose two principled approaches to build an ensemble of 
person re-id algorithms. The hrst approach aims at max¬ 
imizing the relative distance between images of different 
individuals and images of the same individual such that the 
CMC curve approaches one with a minimal number of re¬ 
turned candidates. The second approach directly optimizes 
the probability that any of these top k matches are correct 
using structured learning. Our ensemble-based approaches 
are highly Hexible and can be combined with linear and 
non-linear metrics. 2) Extensive experiments are carried out 
to demonstrate that by building an ensemble of person re- 
id algorithms learned from different visual features, notable 
improvement on rank-1 recognition rate can be obtained. 
Experimental results show that our approach achieves the 
state-of-the-art performance on most person re-id bench¬ 
mark data sets evaluated. In addition, our ensemble ap¬ 
proach is complementary to any existing distance learning 
methods. 

Related work Existing person re-id systems consist of 
two major components; feature representation and metric 
learning. In feature representation, robust and discrimi¬ 
native features are constructed such that they can be used 
to describe the appearance of the same individual across 
different camera views under various changes and condi¬ 
tions ||^[^|^|^[T^|^|^|^. We briefly discuss some of 
these work below. More feature representations, which have 
been applied in person re-id, can be found in pO) . 

Bazzani et al. represent a person by a global mean color 
histogram and recurrent local patterns through epitomic 
analysis Q. Earenzena et al. propose the symmetry-driven 
accumulation of local features which exploits both symme¬ 
try and asymmetry, and represents each part of a person by 
a weighted color histogram, maximally stable color regions 
and texture information |j^ . Gray and Tao introduce an en¬ 
semble of local features which combines three color chan¬ 
nels with 19 texture channels m- Schwartz and Davis pro¬ 
pose a discriminative appearance based model using par¬ 
tial least squares, in which multiple visual features; tex¬ 
ture, gradient and color features are combined pO) . Zhao et 
al. propose dcolorSIET which combines SIET features with 
color histogram. The same authors also propose mid-level 
biters for person re-identibcation by exploring the partial 
area under the ROC curve (pAUC) score ED- 


A large number of metric learning and ranking algori¬ 
thms have been proposed ||4]|^|8][Tg[^[35}{^. Many of 
these have been applied to the problem of person re-id. We 
brieby review some of these algorithms. Interested readers 
should see p9| . Chopra et al. propose an algorithm to learn 
a similarity metric from data Q. The authors train a convo¬ 
lutional network that maps input images into a target space 
such that the £i-norm in the target space approximate the 
semantic distance in the image space. Gray and Tao use Ad- 
aBoost to select discriminative features ED- Koestinger et 
al. propose the large-scale metric learning from equivalence 
constraint which considers a log likelihood ratio test of two 
Gaussian distributions tni- Li et al. propose the learning 
of locally adaptive decision functions, which can be viewed 
as a joint model of a distance metric and a locally adapted 
thresholding rule pT) . Li et al. propose a biter pairing neu¬ 
ral network to learn visual features for the re-identibcation 
task from image data pO) . Pedagadi et al. combine color 
histogram with supervised Local Eisher Discriminant Anal¬ 
ysis p6) . Prosser et al. use pairs of similar and dissimilar 
images and train the ensemble RankSVM such that the true 
match gets the highest rank 1271. Weinberger et al. propose 
the large margin nearest neighbour (LMNN) algorithm to 
learn the Mahalanobis distance metric, which improves the 
k-nearest neighbour classibcation p5) . LMNN is later ap¬ 
plied to a task of person re-identibcation in GD- Wu et al. 
applies the Metric Learning to Rank (MLR) method of ( 
to person re-id p7) . 


Although a large number of existing algorithms have ex¬ 
ploited state-of-the-art visual features and advanced met¬ 
ric learning algorithms, we observe that the best obtained 
overall performance on commonly evaluated person re-id 
benchmarks, e.g., iLIDS and VIPeR, is still far from the 
performance needed for most real-world surveillance appli¬ 
cations. 


Notation Bold lower-case letters, e.g., w, denote col¬ 
umn vectors and bold upper-case letters, e.g., P, denote 
matrices. We assume that the provided training data is for 
the task of single-shot person re-identibcation, i.e., there 
exist only two images of the same person - one image 
taken from camera view A and another image taken from 
camera view B. We represent a set of training samples by 
where Xi € R-® represents a training ex¬ 
ample from one camera {i.e., camera view A), and xf is 
the corresponding image of the same person from a differ¬ 
ent camera {i.e., camera view B). Here m is the number 
of persons in the training data. Prom the given training 
data, we can generate a set of triplets for each sample Xi 
as I {xi, xf, for j = 1, • • • ,m and i ^ j. Here we 

introduce x~j G where X“ denotes a subset of images 
of persons with a different identity to Xi from camera view 
B. We also assume that there exist a set of distance func¬ 
tions which calculate the distance between two given 
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inputs. Our ^oal is to learn a weighted distance function; 

such that the distance between 
Xi (taken from camera view A) and xf (taken from cam¬ 
era view B) is smaller than the distance between Xi and any 
x~j (taken from camera view B). The better the distance 
function, the faster the cumulative matching characteristic 
(CMC) curve approaches one. 


isfied. This condition means that the distance between two 
images of different individuals should be larger by at least a 
unit than the distance between two images of the same indi¬ 
vidual. Since the above condition cannot be satisfied by all 
triplets, we introduce a slack variable to enable soft margin. 
By generalizing the above idea to the entire training set, the 
primal problem that we want to optimize can be written as. 


2. Our Approach 

In this section, we propose two approaches that can learn 
an ensemble of base metrics. We then discuss base metrics 
and visual features that will be used in our experiment. 

2.1. Ensemble of base metrics 

The most commonly used performance measure for eval¬ 
uating person re-id is known as a cumulative matching char¬ 
acteristic (CMC) curve which is analogous to the ROC 
curve in detection problems. The CMC curve represents 
results of an identification task by plotting the probability 
of correct identification (y-axis) against the number of can¬ 
didates returned (x-axis). The faster the CMC curve ap¬ 
proaches one, the better the person re-id algorithm. Since 
a better rank-1 recognition rate is often preferred pT) , our 
aim is to improve the recognition rate among the k best can¬ 
didates, e.g., k < 20, which is crucial for many real-world 
surveillance applications. Note that, in practice, the system 
that achieves the best recognition rate when k is large (e.g., 
k > 100) is of little interest since most users inspect or 
consider only the first ten or twenty returned candidates. 

In this section, we propose two different approaches 
which learn an ensemble of base metrics (discussed in the 
next section). The first approach, CMC‘"^‘p*®‘, aims at min¬ 
imizing the number of returned list of candidates in order 
to achieve a perfect identification, i.e., minimizing k such 
that the rank-fc recognition rate is equal to one. The sec¬ 
ond approach, CMC*°p, optimizes the probability that any 
of these k best matches are correct. 


2.1.1 Relative distance based approach (CMC^'^^p^®*) 


In order to minimize k such that the rank-A: recognition 
rate is equal to 100%, we consider learning an ensemble of 
distance functions based on relative comparison of triplets 
|291. Given a set of triplets | {x^, xf, in which x^ 


is taken from camera view A and ixj, x- ■ | are taken from 
camera view B, the basic idea is to learn a distance func¬ 
tion such that images of the same individual are closer than 
any images of different individuals, i.e., Xi is closer to xf 
than any x~y For a triplet \^{xi,xf ,x~^ ., the follow¬ 


ing condition must hold d{xi, x~j) > d{xi, a:^), Vj, i ^ j. 
Following the large margin framework with the hinge loss, 
the condition d{xi,x~j) >1-1- d{xi,xf) should be sat- 


^ ^ m m—1 

min -||tu||2 + V —7-r A, 

^ ' i=l j=\ 

s.t. w^{dj -df ) > ^ j; 

w>0;^>0. 


( 1 ) 


Here ;/ > 0 is the regularization parameter and dj = 
[di{x„x~.), ••• , dt{x„x~.)], df = [di{xi,xf), ■■■ , 
dt{xi, xf )] and •)/ • • ,dt{-, •)} represent a set of base 

metrics. Note that we introduce the regularization term 
11 tell 2 to avoid the trivial solution of arbitrarily large w. 

We point out here that any smooth convex loss function 
can also be applied. Suppose A( ) is a smooth convex func¬ 
tion defined in M and a;(-) is any regularization function. 
The above optimization problem which enforces the rela¬ 
tive comparison of the triplet can also be written as, 

min w(i(;)-I-i/y^ A(pt-) ( 2 ) 

r 

s.t. Pr = '^Wtdt{xi,x~.) - y^wtdtixi,xf),\/T; 

t t 

w >0, 


where r being the triplet index set. In this paper, we con¬ 
sider the hinge loss but other convex loss functions p2| can 
be applied. 

Since the number of constraints in Q is quadratic in the 
number of training examples, directly solving Q using off- 
the-shelf optimization toolboxes can only solve problems 
with up to a few thousand training examples. In the follow¬ 
ing, we present an equivalent reformulation of Q, which 
can be efficiently solved in a linear runtime using cutting- 
plane algorithms. We first reformulate Q by writing it as; 


min -\\w\\l + v^ 

■w,^ Z 


( 3 ) 


s.t. 


1 


m{m — 1) 


m m— i 


i=i i=i 

j; > 0; I > 0. 


> 1-^, 


Note that the new formulation has a single slack variable. 
Later on in this section, we show how the cutting-plane 
method can be applied to solve Q. 
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2.1.2 Top recognition at rank-fc (CMC *°p) 


Our previous formulation assumes that, for any triplets, im¬ 
ages belonging to the same individual should be closer than 
images belonging to different individuals. Our second for¬ 
mulation is motivated by the nature of the problem, in which 
person re-id users often browse only the first few retrieved 
matches. Hence we propose another approach, in which 
the objective is no longer to minimize k (the number of 
returned matches before achieving 100% recognition rate), 
but to maximize the correct identification among the top k 
best candidates. Built upon the structured learning frame¬ 


work 115 241, we optimize the performance measure com¬ 


monly used in the CMC curve (recognition rate at rank-/c) 
using structured learning. The difference between our work 
and 1241 is that p4) assumes training samples consist of 


m+ positive instances and m_ negative instances, while 
our work assumes that there are m individuals in camera 
view A and m individuals in camera view B. However there 
exists ranking in both works: p4) attempts to rank all pos¬ 
itive samples before a subset of negative samples while our 
works attempt to rank a pair of the same individual above 
a pair of different individuals. Both also apply structure 
learning of GD to solve the optimization problem. 

Given the training individual Xi (from camera view A) 
and its correct match xf from camera view B, we can rep¬ 
resent the relative ordering of all matching candidates in 
camera view B via a vector p G , in which pj is 0 if 
x/' (from camera view B) is ranked above x~^ (from cam¬ 
era view B) and 1 if xf is ranked below x~ ■. Here m' is 
the total number of individuals from camera view B who 
has a different identity to Xi. Since there exists only one 
image of the same individual in the camera view B, m! is 
equal to m — 1 where m is the total number of individu¬ 
als in the training set. We generalize this idea to the entire 
training set and represent the relative ordering via a matrix 


Pe {0,ir 


as follows: 


where S represent a set of triplets generated from the 
training data, dj = [di{xi,x~j), ••• , dt{xi,x~^)] and 
d'l = [di{xi,xf), ••• , dt{xi,xf)]. The choice of 
P) guarantees that the variable w, which opti¬ 
mizes P), will also produce the distance function 

d{-,-) = Y^J^iU)tdt{-, ■) that achieves the optimal average 
recognition rate among the top k candidates. The above 
problem can be summarized as the following convex opti¬ 
mization problem: 

min l\\w\\l + iy^ (7) 

s.t. P*) - ^{S, P)) > A{P\P) - e, 

VP and ^ > 0. Here P* denote the correct relative or¬ 
dering and P denote any arbitrary orderings. Similar to 
CMC‘"^‘P*®*, we use the cutting-plane method to solve Q. 


2.1.3 Cutting-plane optimization 

In this section, we illustrate how the cutting-plane method 
can be used to solve both optimization problems: © and 
0. The key idea of the cutting-plane is that a small sub¬ 
set of the constraints are sufficient to find an e-approximate 
solution to the original problem. The cutting-plane algo¬ 
rithm begins with an empty initial constraint set and itera¬ 
tively adds the most violated constraint set. At each itera¬ 
tion, the algorithm computes the solution over the current 
working set. The algorithm then finds the most violated 
constraint and add it to the working set. The cutting-plane 
algorithm continues until no constraint is violated by more 
than e. Since the quadratic program is of constant size, the 
cutting-plane method converges in a constant number of it¬ 
erations. We present our proposed CMC in Algorithm[^ 
The optimization problem for finding the most violated 
constraint (Algorithm[T] step (D) can be written as. 


P^J = 


if xt is ranked above x- ■ 
otherwise. 


( 4 ) 


The correct relative ordering of P can be defined as P* 
where p*^ = 0,\/i,j. The loss among the top k candidates 
can then be written as, 

.. m k 

= < 5 ) 

i=l j=l 

where (j) denotes the index of the retrieved candidates 
ranked in the j-th position among all top k best candidates. 
We define the joint feature map, ip, of the form: 


P = max A(P*, P) - w^{il;{S, P*) - ipiS, P)) (8) 

= maxA(P*,P) - 

m k m' 

i=l j=l j—k+1 

where — df. Since pij in (|^ is independent, 

the solution to 0 can be solved by maximizing over each 
element pij. Hence P that most violates the constraint cor¬ 
responds to. 


^ m m' 

^ m ■ k EE(i -dt), (6) 


Pi,(i) 




df) < l), if j e {1, • • • 

df) < O), otherwise. 
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Algorithm 1 Cutting-plane algorithm for solving coeffi¬ 
cients of base metrics (CMC *°p) 

Input: 

1) A set of base metrics of the same individual and different 
individuals {d^, 

2) The regularization parameter, u‘, 

3) The cutting-plane termination threshold, e; 

Output: The base metrics’ coefficients w. 

Initialize: The working set, C = 0; 

g(S,P,w) = A(P*,P) - - d+); 

Repeat 

(D Solve the primal problem using linear SVM, 

min -||'nj||2 + g{S, P^w) < ^, VP G C; 

2 

(D Compute the most violated constraint, 

P = max5'(iS', P, m); 

CD e e u {P}: 

Until g{S,P,w) <^ + e; 


For CMC one replaces g{S, P, w) in Algorithm[^ 
with (/(S', m) = 1 - Y.i,j (dj - df )] and re¬ 

peats the same procedure. 

In this section we assume that the base metrics, 
•)}> provided. In the next section, we 
introduce two base metrics adopted in our proposed ap¬ 
proaches. 


2.2. Base metrics 


Metric learning can be divided into two categories; lin¬ 
ear |[^|T7][^ and non-linear methods In 

the linear case, the goal is to learn a linear mapping by esti¬ 
mating a matrix M such that the distance between images 
of the same individual, {xi — xfYM{xi — x^), is less 
than the distance between images of different individuals, 
{xi — x~jy M{xi — x~j). The linear method can be easily 
extended to learn non-linear mapping by kernelization HD- 
The basic idea is to learn a linear mapping in the feature 
space of some non-linear function, (j), such that the distance 
— (j){x'l)YM{(j){xi) — (j){xf)) is less than the dis¬ 
tance (Yix,) - (l){x-j)YM{^{xi) - ^(x-j)). 

Metric learning from equivalence constraints The ba¬ 
sic idea of KISS metric learning (KISS ML) flT) , is to learn 
the Mahalanobis distance by considering a log likelihood 
ratio test of two Gaussian distributions. The likelihood ra¬ 
tio test between dissimilar pairs and similar pairs can be 
written as. 


r{xi,Xj) = log 


^exp(-^3;ES]^^a;,j) 

57exp(-ia;y,Ss^®y)’ 


(9) 


where Xij = Xi-Xj, Cd = Cs = ^27r|Ss|, Sd 

and Eg are covariance matrices of dissimilar pairs and sim¬ 


ilar pairs, respectively. By taking log and discarding con¬ 
stant terms, can be simplified as, 

vixj^^ Xj'^ — ^j') (^s )(^* (1^) 

Hence the Mahalanobis distance matrix M can be written 
as Sg^ — E^^. The authors of | Jt^ clip the spectrum of 
M by eigen-analysis to ensure M is positive semi-definite. 
This simple algorithm has shown to perform surprisingly 
well on the person re-id problem p0]|28) . 

Kernel-based metric learning There exist several non¬ 
linear extensions to metric learning. In this section, 
we introduce recently proposed kernel-based metric learn¬ 
ing, known as kernel Local Fisher Discriminant Analysis 
(kLFDA) p^ , which is a non-linear extension to the previ¬ 
ously proposed LFDA | [26) and has demonstrated the state- 
of-the-art performance on iLIDS, CAVIAR and 3DPeS data 
sets. The basic idea of kLFDA is to find a projection ma¬ 
trix M which maximizes the between-class scatter matrix 
while minimizing the within-class scatter matrix using the 
Fisher discriminant objective. Similar to LFDA, the projec¬ 
tion matrix can be estimated using generalized Eigenvalues. 
Unlike LFDA, kLFDA represent the projection matrix with 
the data samples in the kernel space (/){■). 

2.3. Visual features 

We introduce visual features which have been applied in 
our person re-id approaches. 

SIFT/LAB patterns Scale-invariant feature transform 
(SIFT) has gained a lot of research attention due to its in¬ 
variance to scaling, orientation and illumination changes 
| |22| . The descriptor represents occurrences of gradient ori¬ 
entation in each region. In this work, we combine discrim¬ 
inative SIFT with color histogram extracted from the LAB 
colorspace. 

LBP/RGB patterns Local Binary Pattern (LBP) is an¬ 
other feature descriptor that has received a lot of attention 
in the literature due to its effectiveness and efficiency | [25| . 
The standard version of 8-neighbours LBP has a radius of 
1 and is formed by thresholding the 3x3 neighbourhood 
with the centre pixel’s value. To improve the classification 
accuracy of LBP, we combine LBP histograms with color 
histograms extracted from the RGB colorspace. 

Region covariance patterns Region covariance is an¬ 
other texture descriptor which has shown promising results 
in texture classification p^ . The covariance descriptor is 
extracted from the covariance of several image statistics in¬ 
side a region of interest p3) . Covariance matrix provides 
a measure of the relationship between two or more set of 
variates. The diagonal entries of covariance matrices repre¬ 
sent the variance and the non-diagonal entries represent the 
correlation value between low-level features. 

Neural patterns Large amount of available training data 
and increasing computing power have lead to a recent suc- 
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cess of deep convolutional neural networks (CNN) on a 
large number of computer vision applications. CNN ex¬ 
ploits the strong spatially local correlation present in natural 
images by enforcing a local connectivity pattern between 
neurons of adjacent layers. In the deep CNN architecture, 
convolutional layers are placed alternatively between max¬ 
pooling and contrast normalization layers GD- 

Implementation See supplementary for detailed imple¬ 
mentation. 

3. Experiments 

Datasets There exist several challenging benchmark 
data sets for person re-identihcation. In this experi¬ 
ment, we select four commonly used data sets (iLIDS, 
3DPES, PRID2011, VIPeR) and two recently introduced 
data sets with a large number of individuals (CUHKOl and 
CUHK03). The iLIDS data set has 119 individuals cap¬ 
tured from eight cameras with different viewpoints | |42) . 
The number of images for each individual varies from 2 to 
8, i.e., eight cameras are used to capture 119 individuals. 
The data set consists of large occlusions caused by people 
and luggages. The 3DPeS data set is designed mainly for 
people tracking and person re-identihcation ||T|. It contains 
numerous video sequences taken from a real surveillance 
environment with eight different surveillance cameras and 
consists of 192 individuals. The number of images for each 
individual varies from 2 to 26 images. The Person RE-ID 
2011 (PRID2011) data set consists of images extracted from 
multiple person trajectories recorded from two surveillance 
static cameras GD- Camera view A contains 385 individ¬ 
uals, camera view B contains 749 individuals, with 200 of 
them appearing in both views. Hence, there are 200 person 
image pairs in the dataset. 

VIPeR is one of the most popular used data sets for per¬ 
son re-identihcation GD- It conntains 632 individuals taken 
from two cameras with arbitrary viewpoints and varying il¬ 
lumination conditions. The CUHKOl data set contains 971 
persons captured from two camera views in a campus envi¬ 
ronment G3- Camera view A captures the frontal or back 
view of the individuals while camera view B captures the 
prohle view. Einally, the CUHK03 data set consists of 1360 
persons taken from six cameras pO) The data set consists of 
manually cropped pedestrian images and images cropped 
from the pedestrian detector of pj. Due to the imperfec¬ 
tion in the pedestrian detector, which causes some misalign¬ 
ments of cropped images, we use images which are manu¬ 
ally annotated by hand. 

Evaluation protocol In this paper, we adopt a single¬ 
shot experiment setting, similar to ||^|^[^|^|^. Eor 
all data sets except CUHK03, all the individuals in the data 
set are randomly divided into two subsets so that the training 
set and the test set contains half of the available individuals 
with no overlap on person identities. Eor data set with two 


cameras, we randomly select one image of the individual 
taken from camera view A as the probe image and one im¬ 
age of the same individual taken from camera view B as the 
gallery image. Eor multi-camera data sets, two images of 
the same individual are chosen: one is used as the probe im¬ 
age and the other as the gallery image. Eor CUHK03, we set 
the number of individuals in the train/test split to 1260/100 
as conducted in | |20) . To be more specihc, there are 59, 96, 
100, 316, 485 and 100 individuals in each of the test split 
for the iLIDS, 3DPeS, PRID2011, VIPeR, CUHKOl and 
CUHK03 data sets, respectively. The number of probe im¬ 
ages (test phase) is equal to the number of gallery images 
in all data sets except PRID2011, in which the number of 
probe images is 100 and the number of test gallery images 
is 649 (all images from camera view B except the 100 train¬ 
ing samples). This procedure is repeated 10 times and the 
average of cumulative matching charateristic (CMC) curves 
across 10 partitions is reported. The CMC curve provides 
a ranking for every image in the gallery with respect to the 
probe. 

Parameters setting Eor the linear base metric (KISS 
ML pT)), we apply principal component analysis (PCA) 
to reduce the dimensionality and remove noise. Without 
performing PCA, it is computationally infeasible to inverse 
covariance matrices of both similar and dissimilar pairs as 
discussed in GD- Eor each visual feature, we reduce the 
feature dimension to 64 dimensional subspaces. Eor the 
non-linear base metric (kLEDA p^), we set the regulariza¬ 
tion parameter for class scatter matrix to 0.01, i.e., we add 
a small identity matrix to the class scatter matrix. Eor both 
SIET/LAB and LBP/RGB features, we apply the RBE-x^ 
kernel. Eor region covariance and CNN features, we apply 
the Gaussian RBE kernel k{x, x') = exp(—||a; — x'W/a"^). 
The kernel parameter is tuned to an appropriate value for 
each data set. In this experiment, we set the value of tr^ to 
be the same as the hrst quantile of all distances | [38| . 

Eor CMC^'^'P^®*, we choose the regularization parameter 
{v in Q) from {10^,10^-^,- • • ,10^} by cross-validation on 
the training data. For CMC*°p, we choose the regulariza¬ 
tion parameter {v in 0) from {10^,10^-^,• • - ,10^} by cross- 
validation on the training data. We set the cutting-plane ter¬ 
mination threshold to 10“®. The recall parameter {k in (|^) 
is set to be 10 for iLIDS, 3DPeS, PRID2011 and VIPeR and 
40 for larger data sets (CUHKOl and CUHK03). Since the 
success of metric learning algorithms often depends on the 
choice of good parameters, we train multiple metric learn¬ 
ing for each feature. Specihcally, for KISS ML, we reduce 
their feature dimensionality to 32, 48 and 64 dimensions 
and use all three to learn the weight w for CMC and 
CMC‘°P. Similarly, for kFLDA, we set the cr^ to be the 
same as the 5*, the 10* and the hrst quantile of all dis¬ 
tances. 
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Figure 1: Performance comparison of base metrics with different visual features; SIFT/LAB, LBP/RGB, covariance de¬ 
scriptor and CNN features. Rank-1 recognition rates are shown in parentheses. The higher the recognition rate, the better 
the performance. Ours-Top (CMC*^°p) represents our ensemble approach which optimizes the CMC score over the top k 
returned candidates. Ours-Triplet (CMC^’^'p^®*) represents our ensemble approach which minimizes the number of returned 
candidates such that the rank-/c recognition rate is equal to one. 


iLIDS 3DPeS PRID2011 




Rank 



Figure 2: Performance comparison of CMC with two different base metrics: linear base metric (Linear Metric Learning) 
and non-linear base metric (Non-lin. Metric Learning). On VIPeR, CUHKOl and CUHK03 data sets, an ensemble of non¬ 
linear base metrics significantly outperforms an ensemble of linear base metrics. 
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3.1. Evaluation and analysis 

Feature evaluation We investigate the impact of low- 
level and high-level visual features on the recognition per¬ 
formance of person re-identification. Fig.[2shows the CMC 
performance of different visual features and their rank-1 
recognition rates when trained with the kernel-based LFDA 
(non-linear metric learning) on six benchmark data sets. 
On VIPeR, CUHKOl and CUHK03 data sets, we observe 
that both SIFT/LAB and LBP/RGB significantly outper¬ 
form covariance descriptor and CNN features. This result is 
not surprising since SIFT/LAB combines edges and color 
features while LPB/RGB combines texture and color fea¬ 
tures. We suspect the use of color helps improve the over¬ 
all recognition performance of both features. We observe 
that CNN features perform poorer than hand-crafted low- 
level features in our experiments. We suspect that the CNN 
pre-trained model has been designed for ImageNet object 
categories 0, in which color information might be less 
important. However on many person re-id data sets, a large 
number of persons wear similar types of clothing, e.g., t- 
shirt and jeans, but with different color. Therefore color 
information becomes an important cue for recognizing two 
different individuals. Overall, we observe that SIFT/LAB 
features perform well consistently on all data sets evaluated. 

Ensemble approach with different base metrics Next 
we compare the performance of our approach with two dif¬ 
ferent base metrics: linear metric learning flT) and non¬ 
linear metric learning | [38) (introduced in Sec. |2.2| l. In this 
experiment, we use CMC‘°p to learn an ensemble. Experi¬ 
mental results are shown in Fig.|^ Two observations can be 
made from the hgure: 1) Both approaches perform similarly 
when the number of train/test individuals is small, e.g., on 
iLIDS and 3DPeS data sets; 2) Non-linear base metrics out¬ 
performs linear base metric when the number of individuals 
increase. We suspect that there is less diversity when the 
number of individuals is small. No further improvement 
is observed when we replace linear base metrics with non¬ 
linear base metrics. 

Performance at different recall values Next we com¬ 
pare the performance of the proposed with 

CMC Both optimization algorithms optimize the recog¬ 
nition rate of person re-id but with different objective crite¬ 
ria. We compare the performance of both algorithms with 
the baseline approach, in which we simply set the value of 
m to a uniform weight. Since distance functions of differ¬ 
ent features have different scales, we normalize the distance 
between each probe image to all images in the gallery to be 
between zero and one. In other words, we set the distance 
between the probe image and the nearest gallery image to 
be zero and the distance between the probe image and the 
furthest gallery image to be one. The matching accuracy is 
shown in Table [I] We observe that CMC achieves the 
best recognition rate performance at a small recall value. 


VIPeR Data set 
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Rank 
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Figure 3: CMC performance for VIPeR and CUHKOl data 
sets. The higher the recognition rate, the better the perfor¬ 
mance. Our approach outperforms all existing person re-id 
algorithms. 

At a large recall value (rank > 50), both CMC*°p and 
CMC *'^‘P*®* perform similarly. Interestingly, a simple aver¬ 
aging performs quite well on VIPeR, in which the number 
of individuals in the test set is small. 

3.2. Comparison with state-of-the-art results 

Fig. compares our results with other person re-id al¬ 
gorithms on two major benchmark data sets: VIPeR and 
CUHKOl. Our approach outperforms all existing person 
re-id algorithms. Next we compare our results with the 
best reported results in the literature. The algorithm pro¬ 
posed in p8) achieves state-of-the-art results on iLIDS and 
3DPeS data sets (40.3% and 54.2% recognition rate at rank- 
1, respectively). Our approach outperforms p8) on the 
iLIDS (50.3%) and achieve a comparable result on 3DPeS 
(53.3%). Zhao et al. propose mid-level filters for person 
re-identification ED, which achieve state-of-the-art results 
on the VIPeR and CUHKOl data sets (43.39% and 34.30% 
recognition rate at rank-1, respectively). Our approach out¬ 
performs ED achieving a recognition rate of 45.89% 
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Rank 

VIPeR 

CUHKOl 

CUHK03 

Avg. 

CMC*"P‘°* 

CMC*°P 

Avg. 

CMC*"P‘®* 

CMC"°P 

Avg. 

CMC*"P“=* 

CMC*°P 

1 

44.9 

45.7 

45.9 

51.9 

53.0 

53.4 

57.4 

60.5 

62.1 

2 

58.3 

59.6 

60.2 

63.3 

64.1 

64.3 

71.7 

73.5 

76.6 

5 

76.3 

77.1 

77.5 

75.1 

76.1 

76.4 

85.9 

87.8 

89.1 

10 

88.2 

88.9 

88.9 

83.0 

84.0 

84.4 

93.1 

93.5 

94.3 

20 

94.9 

95.7 

95.8 

89.4 

90.7 

90.5 

96.9 

97.4 

97.8 

50 

99.4 

99.5 

99.5 

95.9 

96.4 

96.4 

99.5 

99.7 

99.7 

100 

99.9 

100.0 

100.0 

98.6 

98.6 

98.6 

100.0 

100.0 

100.0 


Table 1: Re-id recognition rate (%) at different recall (rank). The best result is shown in boldface. Both CMC*°p and 
CMCachieve similar performance when retrieving > 50 candidates. 


and 53.40% on the VIPeR and CUHKOl data sets, respec¬ 
tively. Tablej^compares our results with other state-of-the- 
art methods on other person re-identification data sets. 

4. Conclusion 

In this paper, we present an effective structured learning 
based approach for person re-id by combining multiple low- 
level and high-level visual features into a single framework. 
Our approach is practical to real-world applications since 
the performance can be concentrated in the range of most 
practical importance. Moreover our proposed approach is 
flexible and can be applied to any metric learning algori¬ 
thms. Experimental results demonstrate the effectiveness of 
the proposed approach on six major person re-id data sets. 
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